This is my attempt to create a document out of the inclass coding scripts.
Step 1 : Load tidyverse library and read the data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
indata <- read.csv("~/Desktop/FertilityRates.csv")
Step 2: Need to check if the data loaded correctly or not.
dim(indata)
## [1] 219 56
From the above output we can see there are 219 rows and 56 columns
Step 3 : Check the first 6th rows of the data by using a function head() and for the last 6 columns we need to use a function tail()
head(indata)
## Country.Name Country.Code Indicator.Name
## 1 Aruba ABW Fertility rate, total (births per woman)
## 2 Andorra AND Fertility rate, total (births per woman)
## 3 Afghanistan AFG Fertility rate, total (births per woman)
## 4 Angola AGO Fertility rate, total (births per woman)
## 5 Albania ALB Fertility rate, total (births per woman)
## 6 United Arab Emirates ARE Fertility rate, total (births per woman)
## Indicator.Code X1960 X1961 X1962 X1963 X1964 X1965 X1966 X1967 X1968 X1969
## 1 SP.DYN.TFRT.IN 4.820 4.655 4.471 4.271 4.059 3.842 3.625 3.417 3.226 3.054
## 2 SP.DYN.TFRT.IN NA NA NA NA NA NA NA NA NA NA
## 3 SP.DYN.TFRT.IN 7.671 7.671 7.671 7.671 7.671 7.671 7.671 7.671 7.671 7.671
## 4 SP.DYN.TFRT.IN 7.316 7.354 7.385 7.410 7.425 7.430 7.422 7.403 7.375 7.339
## 5 SP.DYN.TFRT.IN 6.186 6.076 5.956 5.833 5.711 5.594 5.483 5.376 5.268 5.160
## 6 SP.DYN.TFRT.IN 6.928 6.910 6.893 6.877 6.861 6.841 6.816 6.783 6.738 6.679
## X1970 X1971 X1972 X1973 X1974 X1975 X1976 X1977 X1978 X1979 X1980 X1981 X1982
## 1 2.908 2.788 2.691 2.613 2.552 2.506 2.472 2.446 2.425 2.408 2.392 2.377 2.364
## 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 7.671 7.671 7.671 7.671 7.671 7.671 7.670 7.670 7.670 7.669 7.669 7.670 7.671
## 4 7.301 7.264 7.232 7.208 7.192 7.185 7.186 7.189 7.194 7.197 7.200 7.201 7.203
## 5 5.050 4.933 4.809 4.677 4.538 4.393 4.244 4.094 3.947 3.807 3.678 3.562 3.460
## 6 6.605 6.512 6.402 6.279 6.146 6.009 5.873 5.744 5.624 5.517 5.423 5.344 5.274
## X1983 X1984 X1985 X1986 X1987 X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995
## 1 2.353 2.342 2.332 2.320 2.307 2.291 2.272 2.249 2.221 2.187 2.149 2.108 2.064
## 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 7.673 7.676 7.679 7.681 7.682 7.682 7.682 7.687 7.700 7.725 7.758 7.796 7.832
## 4 7.205 7.207 7.208 7.206 7.202 7.194 7.182 7.165 7.143 7.116 7.087 7.054 7.019
## 5 3.372 3.297 3.233 3.177 3.126 3.075 3.023 2.970 2.917 2.867 2.819 2.772 2.723
## 6 5.209 5.141 5.065 4.973 4.860 4.724 4.566 4.388 4.193 3.989 3.784 3.583 3.393
## X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008
## 1 2.021 1.979 1.940 1.905 1.874 1.848 1.825 1.805 1.786 1.769 1.754 1.739 1.726
## 2 NA NA NA NA NA NA NA NA NA NA 1.240 1.180 1.250
## 3 7.859 7.869 7.854 7.809 7.733 7.623 7.484 7.321 7.136 6.930 6.702 6.456 6.196
## 4 6.984 6.949 6.913 6.878 6.844 6.811 6.778 6.743 6.704 6.657 6.598 6.523 6.434
## 5 2.670 2.611 2.543 2.467 2.383 2.291 2.195 2.097 2.004 1.919 1.849 1.796 1.761
## 6 3.215 3.052 2.902 2.766 2.644 2.532 2.428 2.329 2.236 2.149 2.071 2.004 1.948
## X2009 X2010 X2011
## 1 1.713 1.701 1.690
## 2 1.190 1.220 NA
## 3 5.928 5.659 5.395
## 4 6.331 6.218 6.099
## 5 1.744 1.741 1.748
## 6 1.903 1.868 1.841
tail(indata)
## Country.Name Country.Code Indicator.Name
## 214 Samoa WSM Fertility rate, total (births per woman)
## 215 Yemen, Rep. YEM Fertility rate, total (births per woman)
## 216 South Africa ZAF Fertility rate, total (births per woman)
## 217 Congo, Dem. Rep. COD Fertility rate, total (births per woman)
## 218 Zambia ZMB Fertility rate, total (births per woman)
## 219 Zimbabwe ZWE Fertility rate, total (births per woman)
## Indicator.Code X1960 X1961 X1962 X1963 X1964 X1965 X1966 X1967 X1968 X1969
## 214 SP.DYN.TFRT.IN 7.651 7.645 7.630 7.604 7.567 7.519 7.462 7.400 7.334 7.265
## 215 SP.DYN.TFRT.IN 7.292 7.308 7.328 7.354 7.385 7.418 7.447 7.471 7.491 7.512
## 216 SP.DYN.TFRT.IN 6.173 6.144 6.103 6.049 5.984 5.911 5.836 5.765 5.700 5.643
## 217 SP.DYN.TFRT.IN 6.001 6.015 6.030 6.048 6.067 6.089 6.111 6.135 6.161 6.187
## 218 SP.DYN.TFRT.IN 7.018 7.071 7.127 7.184 7.240 7.292 7.338 7.376 7.405 7.425
## 219 SP.DYN.TFRT.IN 7.158 7.215 7.267 7.311 7.347 7.373 7.391 7.403 7.411 7.415
## X1970 X1971 X1972 X1973 X1974 X1975 X1976 X1977 X1978 X1979 X1980 X1981
## 214 7.194 7.119 7.039 6.952 6.859 6.761 6.656 6.547 6.434 6.320 6.203 6.086
## 215 7.542 7.593 7.672 7.782 7.922 8.089 8.277 8.474 8.667 8.843 8.993 9.108
## 216 5.591 5.539 5.482 5.415 5.338 5.251 5.159 5.064 4.969 4.877 4.786 4.695
## 217 6.214 6.242 6.271 6.301 6.333 6.366 6.403 6.443 6.487 6.535 6.585 6.636
## 218 7.437 7.443 7.446 7.447 7.444 7.435 7.414 7.378 7.326 7.259 7.182 7.098
## 219 7.417 7.419 7.420 7.417 7.410 7.395 7.369 7.330 7.273 7.196 7.095 6.967
## X1982 X1983 X1984 X1985 X1986 X1987 X1988 X1989 X1990 X1991 X1992 X1993
## 214 5.968 5.850 5.734 5.620 5.510 5.404 5.303 5.208 5.118 5.034 4.956 4.882
## 215 9.185 9.223 9.223 9.186 9.119 9.030 8.925 8.805 8.667 8.504 8.311 8.088
## 216 4.602 4.504 4.402 4.293 4.177 4.052 3.923 3.791 3.658 3.530 3.408 3.296
## 217 6.689 6.742 6.794 6.847 6.902 6.960 7.019 7.077 7.133 7.183 7.223 7.251
## 218 7.015 6.937 6.865 6.799 6.737 6.673 6.606 6.537 6.468 6.401 6.340 6.288
## 219 6.811 6.633 6.435 6.223 6.004 5.784 5.569 5.365 5.176 5.001 4.840 4.690
## X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005
## 214 4.815 4.751 4.692 4.637 4.587 4.541 4.503 4.476 4.460 4.454 4.456 4.460
## 215 7.841 7.578 7.310 7.049 6.802 6.574 6.363 6.166 5.975 5.782 5.588 5.393
## 216 3.196 3.110 3.040 2.983 2.937 2.899 2.866 2.834 2.801 2.763 2.721 2.675
## 217 7.267 7.267 7.253 7.227 7.190 7.143 7.089 7.027 6.960 6.887 6.809 6.728
## 218 6.245 6.209 6.179 6.152 6.126 6.098 6.071 6.044 6.018 5.995 5.974 5.954
## 219 4.554 4.432 4.328 4.240 4.169 4.112 4.069 4.039 4.018 4.002 3.987 3.969
## X2006 X2007 X2008 X2009 X2010 X2011
## 214 4.460 4.450 4.426 4.389 4.338 4.277
## 215 5.199 5.010 4.829 4.658 4.498 4.348
## 216 2.627 2.580 2.538 2.500 2.467 2.438
## 217 6.642 6.550 6.454 6.354 6.251 6.146
## 218 5.932 5.908 5.881 5.849 5.813 5.773
## 219 3.941 3.903 3.853 3.792 3.721 3.643
Data looks as expected
Step 4 : Check whether it is a data frame or a tibble?
If it is a data frame we will get an output as True. In case it is a tibble then the output will be False, so we need to convert it into tibble by using a function called as_tibble
is.data.frame(indata)
## [1] TRUE
We received an output as True it means it is a data frame.
Step 5 : Check the data is tibble or not?
is_tibble(indata)
## [1] FALSE
We can see that we got an output as False so it is not a data frame.
Step 6 : Convert it into a tibble by using a function called as_tibble
indata <- as_tibble(indata)
is_tibble(indata)
## [1] TRUE
After converting, we received an output as True so it is a tibble.
Step 7 : Need to check the summary of data
summary(indata)
## Country.Name Country.Code Indicator.Name Indicator.Code
## Length:219 Length:219 Length:219 Length:219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## X1960 X1961 X1962 X1963
## Min. :1.940 Min. :1.940 Min. :1.790 Min. :1.810
## 1st Qu.:4.210 1st Qu.:4.027 1st Qu.:4.123 1st Qu.:4.057
## Median :6.179 Median :6.144 Median :6.122 Median :6.104
## Mean :5.512 Mean :5.492 Mean :5.492 Mean :5.488
## 3rd Qu.:6.803 3rd Qu.:6.803 3rd Qu.:6.821 3rd Qu.:6.821
## Max. :8.187 Max. :8.194 Max. :8.197 Max. :8.198
## NA's :25 NA's :24 NA's :25 NA's :26
## X1964 X1965 X1966 X1967
## Min. :1.790 Min. :1.740 Min. :1.580 Min. :1.800
## 1st Qu.:3.950 1st Qu.:3.821 1st Qu.:3.644 1st Qu.:3.549
## Median :6.061 Median :6.079 Median :6.045 Median :5.995
## Mean :5.442 Mean :5.392 Mean :5.338 Mean :5.294
## 3rd Qu.:6.801 3rd Qu.:6.799 3rd Qu.:6.795 3rd Qu.:6.747
## Max. :8.198 Max. :8.198 Max. :8.198 Max. :8.201
## NA's :25 NA's :25 NA's :25 NA's :25
## X1968 X1969 X1970 X1971
## Min. :1.830 Min. :1.851 Min. :1.828 Min. :1.703
## 1st Qu.:3.394 1st Qu.:3.247 1st Qu.:3.093 1st Qu.:2.999
## Median :5.912 Median :5.798 Median :5.745 Median :5.681
## Mean :5.240 Mean :5.187 Mean :5.134 Mean :5.074
## 3rd Qu.:6.738 3rd Qu.:6.687 3rd Qu.:6.693 3rd Qu.:6.665
## Max. :8.207 Max. :8.217 Max. :8.231 Max. :8.252
## NA's :25 NA's :25 NA's :25 NA's :24
## X1972 X1973 X1974 X1975
## Min. :1.593 Min. :1.504 Min. :1.510 Min. :1.450
## 1st Qu.:3.026 1st Qu.:2.943 1st Qu.:2.851 1st Qu.:2.712
## Median :5.521 Median :5.416 Median :5.340 Median :5.234
## Mean :5.022 Mean :4.962 Mean :4.904 Mean :4.838
## 3rd Qu.:6.713 3rd Qu.:6.710 3rd Qu.:6.676 3rd Qu.:6.674
## Max. :8.278 Max. :8.307 Max. :8.339 Max. :8.370
## NA's :23 NA's :25 NA's :25 NA's :25
## X1976 X1977 X1978 X1979
## Min. :1.440 Min. :1.400 Min. :1.380 Min. :1.380
## 1st Qu.:2.591 1st Qu.:2.506 1st Qu.:2.472 1st Qu.:2.453
## Median :5.159 Median :5.093 Median :5.022 Median :4.884
## Mean :4.791 Mean :4.713 Mean :4.657 Mean :4.610
## 3rd Qu.:6.635 3rd Qu.:6.582 3rd Qu.:6.535 3rd Qu.:6.483
## Max. :8.399 Max. :8.474 Max. :8.667 Max. :8.843
## NA's :25 NA's :25 NA's :25 NA's :25
## X1980 X1981 X1982 X1983
## Min. :1.440 Min. :1.430 Min. :1.410 Min. :1.330
## 1st Qu.:2.401 1st Qu.:2.370 1st Qu.:2.390 1st Qu.:2.346
## Median :4.755 Median :4.607 Median :4.505 Median :4.456
## Mean :4.559 Mean :4.495 Mean :4.436 Mean :4.393
## 3rd Qu.:6.439 3rd Qu.:6.364 3rd Qu.:6.309 3rd Qu.:6.272
## Max. :8.993 Max. :9.108 Max. :9.185 Max. :9.223
## NA's :25 NA's :23 NA's :20 NA's :23
## X1984 X1985 X1986 X1987
## Min. :1.290 Min. :1.370 Min. :1.340 Min. :1.280
## 1st Qu.:2.294 1st Qu.:2.301 1st Qu.:2.257 1st Qu.:2.274
## Median :4.359 Median :4.224 Median :4.112 Median :3.985
## Mean :4.340 Mean :4.279 Mean :4.219 Mean :4.149
## 3rd Qu.:6.220 3rd Qu.:6.186 3rd Qu.:6.061 3rd Qu.:5.905
## Max. :9.223 Max. :9.186 Max. :9.119 Max. :9.030
## NA's :23 NA's :23 NA's :23 NA's :19
## X1988 X1989 X1990 X1991
## Min. :1.320 Min. :1.280 Min. :1.260 Min. :1.270
## 1st Qu.:2.232 1st Qu.:2.200 1st Qu.:2.180 1st Qu.:2.121
## Median :3.908 Median :3.752 Median :3.558 Median :3.485
## Mean :4.099 Mean :4.024 Mean :3.956 Mean :3.875
## 3rd Qu.:5.840 3rd Qu.:5.727 3rd Qu.:5.577 3rd Qu.:5.442
## Max. :8.925 Max. :8.805 Max. :8.667 Max. :8.504
## NA's :23 NA's :23 NA's :20 NA's :20
## X1992 X1993 X1994 X1995
## Min. :1.290 Min. :1.250 Min. :1.200 Min. :1.180
## 1st Qu.:2.120 1st Qu.:2.027 1st Qu.:1.960 1st Qu.:1.879
## Median :3.330 Median :3.279 Median :3.147 Median :3.082
## Mean :3.791 Mean :3.729 Mean :3.645 Mean :3.561
## 3rd Qu.:5.259 3rd Qu.:5.191 3rd Qu.:5.056 3rd Qu.:4.956
## Max. :8.311 Max. :8.088 Max. :7.841 Max. :7.832
## NA's :18 NA's :21 NA's :20 NA's :18
## X1996 X1997 X1998 X1999
## Min. :1.150 Min. :1.090 Min. :1.017 Min. :0.982
## 1st Qu.:1.913 1st Qu.:1.854 1st Qu.:1.806 1st Qu.:1.796
## Median :2.989 Median :2.869 Median :2.855 Median :2.804
## Mean :3.517 Mean :3.422 Mean :3.379 Mean :3.337
## 3rd Qu.:4.864 3rd Qu.:4.636 3rd Qu.:4.593 3rd Qu.:4.545
## Max. :7.859 Max. :7.869 Max. :7.854 Max. :7.809
## NA's :21 NA's :17 NA's :20 NA's :19
## X2000 X2001 X2002 X2003
## Min. :0.939 Min. :0.891 Min. :0.856 Min. :0.838
## 1st Qu.:1.778 1st Qu.:1.780 1st Qu.:1.759 1st Qu.:1.770
## Median :2.679 Median :2.614 Median :2.558 Median :2.524
## Mean :3.252 Mean :3.199 Mean :3.134 Mean :3.103
## 3rd Qu.:4.352 3rd Qu.:4.268 3rd Qu.:4.134 3rd Qu.:4.050
## Max. :7.733 Max. :7.704 Max. :7.681 Max. :7.658
## NA's :17 NA's :18 NA's :15 NA's :17
## X2004 X2005 X2006 X2007
## Min. :0.836 Min. :0.849 Min. :0.874 Min. :0.906
## 1st Qu.:1.782 1st Qu.:1.775 1st Qu.:1.800 1st Qu.:1.797
## Median :2.518 Median :2.496 Median :2.425 Median :2.422
## Mean :3.075 Mean :3.038 Mean :2.995 Mean :2.961
## 3rd Qu.:3.987 3rd Qu.:3.980 3rd Qu.:3.911 3rd Qu.:3.818
## Max. :7.636 Max. :7.617 Max. :7.602 Max. :7.593
## NA's :18 NA's :16 NA's :14 NA's :13
## X2008 X2009 X2010 X2011
## Min. :0.939 Min. :0.973 Min. :1.003 Min. :1.031
## 1st Qu.:1.796 1st Qu.:1.800 1st Qu.:1.800 1st Qu.:1.796
## Median :2.384 Median :2.374 Median :2.344 Median :2.334
## Mean :2.935 Mean :2.904 Mean :2.876 Mean :2.854
## 3rd Qu.:3.733 3rd Qu.:3.691 3rd Qu.:3.665 3rd Qu.:3.633
## Max. :7.588 Max. :7.585 Max. :7.584 Max. :7.581
## NA's :14 NA's :14 NA's :15 NA's :17
From the above output we can see that summary of the data such as mean, median, 1st quartile value, 3rd quartile value.
Step 8 : Need to convert Country Name into a factor by using as.factor()
indata$Country.Name <- as.factor(indata$Country.Name)
summary(indata$Country.Name)
## Afghanistan Albania Algeria
## 1 1 1
## American Samoa Andorra Angola
## 1 1 1
## Antigua and Barbuda Argentina Armenia
## 1 1 1
## Aruba Australia Austria
## 1 1 1
## Azerbaijan Bahamas, The Bahrain
## 1 1 1
## Bangladesh Barbados Belarus
## 1 1 1
## Belgium Belize Benin
## 1 1 1
## Bermuda Bhutan Bolivia
## 1 1 1
## Bosnia and Herzegovina Botswana Brazil
## 1 1 1
## Brunei Darussalam Bulgaria Burkina Faso
## 1 1 1
## Burundi Cabo Verde Cambodia
## 1 1 1
## Cameroon Canada Cayman Islands
## 1 1 1
## Central African Republic Chad Channel Islands
## 1 1 1
## Chile China Colombia
## 1 1 1
## Comoros Congo, Dem. Rep. Congo, Rep.
## 1 1 1
## Costa Rica Cote d'Ivoire Croatia
## 1 1 1
## Cuba Curacao Cyprus
## 1 1 1
## Czech Republic Denmark Djibouti
## 1 1 1
## Dominica Dominican Republic Ecuador
## 1 1 1
## Egypt, Arab Rep. El Salvador Equatorial Guinea
## 1 1 1
## Eritrea Estonia Ethiopia
## 1 1 1
## Faeroe Islands Fiji Finland
## 1 1 1
## France French Polynesia Gabon
## 1 1 1
## Gambia, The Georgia Germany
## 1 1 1
## Ghana Greece Greenland
## 1 1 1
## Grenada Guam Guatemala
## 1 1 1
## Guinea Guinea-Bissau Guyana
## 1 1 1
## Haiti Honduras Hong Kong SAR, China
## 1 1 1
## Hungary Iceland India
## 1 1 1
## Indonesia Iran, Islamic Rep. Iraq
## 1 1 1
## Ireland Isle of Man Israel
## 1 1 1
## Italy Jamaica Japan
## 1 1 1
## Jordan Kazakhstan Kenya
## 1 1 1
## (Other)
## 120
From the above output we can see that summary of country name and converted country name into a factor
Step 9 :Need to convert Indicator Name into a factor by using the as.factor()
indata$Indicator.Name <- as.factor(indata$Indicator.Name)
summary(indata$Indicator.Name)
## Fertility rate, total (births per woman)
## 219
From the above output we can see that summary of Indicator name and converted country name into a factor
Step 10 : Need to remove extra columns
indata_cleaned <- select(indata, -(Country.Code:Indicator.Code))
head(indata_cleaned)
## # A tibble: 6 × 53
## Country.Name X1960 X1961 X1962 X1963 X1964 X1965 X1966 X1967 X1968 X1969 X1970
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aruba 4.82 4.66 4.47 4.27 4.06 3.84 3.62 3.42 3.23 3.05 2.91
## 2 Andorra NA NA NA NA NA NA NA NA NA NA NA
## 3 Afghanistan 7.67 7.67 7.67 7.67 7.67 7.67 7.67 7.67 7.67 7.67 7.67
## 4 Angola 7.32 7.35 7.38 7.41 7.42 7.43 7.42 7.40 7.38 7.34 7.30
## 5 Albania 6.19 6.08 5.96 5.83 5.71 5.59 5.48 5.38 5.27 5.16 5.05
## 6 United Arab… 6.93 6.91 6.89 6.88 6.86 6.84 6.82 6.78 6.74 6.68 6.60
## # … with 41 more variables: X1971 <dbl>, X1972 <dbl>, X1973 <dbl>, X1974 <dbl>,
## # X1975 <dbl>, X1976 <dbl>, X1977 <dbl>, X1978 <dbl>, X1979 <dbl>,
## # X1980 <dbl>, X1981 <dbl>, X1982 <dbl>, X1983 <dbl>, X1984 <dbl>,
## # X1985 <dbl>, X1986 <dbl>, X1987 <dbl>, X1988 <dbl>, X1989 <dbl>,
## # X1990 <dbl>, X1991 <dbl>, X1992 <dbl>, X1993 <dbl>, X1994 <dbl>,
## # X1995 <dbl>, X1996 <dbl>, X1997 <dbl>, X1998 <dbl>, X1999 <dbl>,
## # X2000 <dbl>, X2001 <dbl>, X2002 <dbl>, X2003 <dbl>, X2004 <dbl>, …
We can observe that we have removed the columns.
Step 11 : Convert wider dataset into a longer dataset by using pivot_longer
indata_pivoted <- pivot_longer(indata_cleaned, c(str_c("X", 1960:2011)), names_to = "Year", values_to="Fertility.Rates")
head(indata_pivoted)
## # A tibble: 6 × 3
## Country.Name Year Fertility.Rates
## <fct> <chr> <dbl>
## 1 Aruba X1960 4.82
## 2 Aruba X1961 4.66
## 3 Aruba X1962 4.47
## 4 Aruba X1963 4.27
## 5 Aruba X1964 4.06
## 6 Aruba X1965 3.84
From the result we can observe each year Fertility rates
Step 12 : While formating the year values we have to remove extra characters from it after that we need to convert into an integer by using as.integer
indata_pivoted$Year <- as.integer(str_sub(indata_pivoted$Year, 2,5))
The output we observe that the characters has been removed successfully.
Step 13 : Need to check how many missing values are there in Fertility.Rates column by using as.na()
sum(is.na(indata_pivoted$Fertility.Rates))
## [1] 1104
Step 14 : Group missing values by country and filter missing values after that group by country and finally count the number of values.
indata_pivoted %>% filter(is.na(Fertility.Rates)) %>% group_by(Country.Name) %>% summarise(count = n())
## # A tibble: 27 × 2
## Country.Name count
## <fct> <int>
## 1 American Samoa 52
## 2 Andorra 47
## 3 Bermuda 43
## 4 Cayman Islands 52
## 5 Curacao 47
## 6 Dominica 45
## 7 Faeroe Islands 52
## 8 Greenland 30
## 9 Isle of Man 49
## 10 Kosovo 21
## # … with 17 more rows
Filtering fertility rates and grouping by country name.
Step 15 : Within the countries, we need to fill missing values by using downup()
indata_filled <- indata_pivoted %>% group_by(Country.Name) %>% fill(Fertility.Rates, .direction = "downup") %>% ungroup()
Step 16 : We have to check whether the missing values are completely removed or not.
indata_filled %>% filter(is.na(Fertility.Rates)) %>% group_by(Country.Name) %>% summarise(count = n())
## # A tibble: 9 × 2
## Country.Name count
## <fct> <int>
## 1 American Samoa 52
## 2 Cayman Islands 52
## 3 Faeroe Islands 52
## 4 Monaco 52
## 5 Northern Mariana Islands 52
## 6 San Marino 52
## 7 Sub-Saharan Africa (IFC classification) 52
## 8 Turks and Caicos Islands 52
## 9 Tuvalu 52
Step 17 : Filter all countries with missing values and removing them
indata_filled <- indata_filled %>% filter(!is.na(Fertility.Rates))
Step 18 : Plot a scatter plot, where x-axis = Year and Y axis = Fertility.Rates
ggplot(data = indata_filled) + geom_point(mapping = aes(x=Year, y = Fertility.Rates))
From the above chart is not visually clear need to improve it’s effectiveness
Step 19 : Plot a scatter plot with a smoothing line
ggplot(data = indata_filled) + geom_point(mapping = aes(x=Year, y = Fertility.Rates), position = "jitter", alpha = 0.1) + geom_smooth(mapping = aes(x=Year, y = Fertility.Rates)) + labs(x = "Year", y = "Fertility Rates", title = "Fertility Rates over the years", subtitle = "Global fertility rates have decreased since 1960")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Declining Fertility rates are observe with the help of a regression line
Step 20 : Need to categorize 210 countries by two ways option1 : So many data points are there, we need to reduce some of them so we are subsetting the data.
indata_subset <- filter(indata_filled, Country.Name %in% c("United States", "Mexico", "Canada"))
ggplot(data = indata_subset) + geom_line(mapping = aes(x=Year, y = Fertility.Rates, color = Country.Name), alpha = 0.5) + labs(x = "Year", y = "Fertility Rates", title = "Fertility Rates over the years", subtitle = "Global fertility rates have decreased since 1960")
Global fertility rates of these three countries are decreasing year by year from 1960.
Option 2 : 210 countries categorize by selecting based on statistics and printing 10 countries by using top_n(10)
indata_filled %>% group_by(Country.Name) %>% summarise(avg = mean(Fertility.Rates)) %>% arrange(desc(avg)) %>% print(n = 10)
## # A tibble: 210 × 2
## Country.Name avg
## <fct> <dbl>
## 1 Niger 7.59
## 2 Afghanistan 7.47
## 3 Yemen, Rep. 7.43
## 4 Somalia 7.26
## 5 Rwanda 7.23
## 6 Burundi 7.19
## 7 Angola 7.06
## 8 Uganda 6.94
## 9 Mali 6.92
## 10 Chad 6.92
## # … with 200 more rows
Step 21 : Print bottom 10 countries by using top_n(-10)
indata_filled %>% group_by(Country.Name) %>% summarise(avg = mean(Fertility.Rates)) %>% arrange(desc(avg)) %>% top_n(-10) %>% ggplot() + geom_bar(mapping = aes(x = Country.Name, y = avg), stat = "identity") +coord_flip()
## Selecting by avg
From the above graph, we can observe the difference in average of bottom different countries, switzerland has the highest average while andorra has the least average
Step 22 : If there is no countrycode library then need to install it.
library(countrycode)
Step 23 : Need to do optioning Adding a new variable called continent and grouping them by country name
indata_df <- as.data.frame(indata_filled)
indata_df$Continent.Name <- factor(countrycode(sourcevar = indata_df[,"Country.Name"], origin = "country.name", destination = "continent"))
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Channel Islands, Kosovo, Latin America & Caribbean (all income levels), OECD members, Other small states, Pacific island small states
Step 24 : Adding a new variable called region and grouping them by country
indata_df$Region.Name <- factor(countrycode(sourcevar = indata_df[,"Country.Name"], origin = "country.name", destination = "region"))
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Latin America & Caribbean (all income levels), OECD members, Other small states, Pacific island small states
Step 25 : Here, in the result we can observe countries divided by continent name and region name
head(indata_df)
## Country.Name Year Fertility.Rates Continent.Name Region.Name
## 1 Aruba 1960 4.820 Americas Latin America & Caribbean
## 2 Aruba 1961 4.655 Americas Latin America & Caribbean
## 3 Aruba 1962 4.471 Americas Latin America & Caribbean
## 4 Aruba 1963 4.271 Americas Latin America & Caribbean
## 5 Aruba 1964 4.059 Americas Latin America & Caribbean
## 6 Aruba 1965 3.842 Americas Latin America & Caribbean
Step 26 : Convert data into tibble by using as_tibble
indata <- as_tibble(indata_df)
head(indata)
## # A tibble: 6 × 5
## Country.Name Year Fertility.Rates Continent.Name Region.Name
## <fct> <int> <dbl> <fct> <fct>
## 1 Aruba 1960 4.82 Americas Latin America & Caribbean
## 2 Aruba 1961 4.66 Americas Latin America & Caribbean
## 3 Aruba 1962 4.47 Americas Latin America & Caribbean
## 4 Aruba 1963 4.27 Americas Latin America & Caribbean
## 5 Aruba 1964 4.06 Americas Latin America & Caribbean
## 6 Aruba 1965 3.84 Americas Latin America & Caribbean
We can see it is converted into a tibble
Step 27 : Plotting Fertility rates by differentiating by continent name
ggplot(data = indata) + geom_point(mapping = aes(x=Year, y = Fertility.Rates, color = Continent.Name), position = "jitter", alpha = 0.6) + labs(x = "Year", y = "Fertility Rates", title = "Fertility Rates over the years", subtitle = "Global fertility rates have decreased since 1960")
The fertility rates of different continents can be seen on yearly basis can be observed with the color difference.
Step 28 : Plotting the counties in Middle East & North Africa region and Asian continent
indata %>% filter(Region.Name == "Middle East & North Africa" & Continent.Name == "Asia") %>% ggplot() + geom_line(mapping = aes(x =Year, y = Fertility.Rates, color = Country.Name), size = 1, linetype = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
The fertility rates of different Countries can be seen year wise
Step 29 : Plot a Boxplot of different continent Fertility rates
ggplot(data = indata, mapping = aes(x = Continent.Name, y = Fertility.Rates)) + geom_boxplot() + coord_flip()
We can see the fertility rates of different continents
Step 30 : Plot a Histogram where x axis is Fertility Rates and y-axis is Count
ggplot(data = indata) + geom_histogram(mapping = aes(x = Fertility.Rates), binwidth = 0.5)
From the above output we can see that fertility rates with the count was plotted
Step 31 : calculating average and plotting the average of different regions by using Bar graph
indata %>% group_by(Region.Name) %>% summarise(avg = mean(Fertility.Rates)) %>% ggplot(mapping = aes(x = reorder(Region.Name, -avg), y = avg, fill=Region.Name)) + geom_bar(stat = "identity") + coord_flip()
We can observe the difference between average of different regions in ascending order